The development of new corpora for under-resourced languages using data available for well-resourced ones
نویسندگان
چکیده
In the paper we propose to exploit existing corpora of wellresourced languages as a basis for developing similar corpora of under-resourced ones. The construction of this type of corpora will allow finding common patterns of acoustic manifestation of similar functional states regardless of the language. The analysis of these corpora will also allow investigating universal and language-specific features reflected in speech. Two pilot experiments which may contribute to the proposed strategy are presented.
منابع مشابه
Collaboratively Annotating Multilingual Parallel Corpora in the Biomedical Domain―some MANTRAs
The coverage of multilingual biomedical resources is high for the English language, yet sparse for non-English languages—an observation which holds for seemingly well-resourced, yet still dramatically low-resourced ones such as Spanish, French or German but even more so for really under-resourced ones such as Dutch. We here present experimental results for automatically annotating parallel corp...
متن کاملExtraction de corpus parallèle pour la traduction automatique depuis et vers une langue peu dotée. (Extraction a parallel corpus for machine translation from and to under-resourced languages)
Nowadays, machine translation has reached good results when applied to several language pairs such as English – French, English – Chinese, English – Spanish, etc. Empirical translation, particularly statistical machine translation allows us to build quickly a translation system if adequate data is available because statistical machine translation is based on models trained from large parallel b...
متن کاملAnalysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation
Lack of sufficient linguistic resources and parallel corpora for many languages and domains currently is one of the major obstacles to further advancement of automated translation. The solution proposed in this paper is to exploit the fact that non-parallel bior multilingual text resources are much more widely available than parallel translation data. This position paper presents previous resea...
متن کاملLexicon+TX: rapid construction of a multilingual lexicon with under-resourced languages
Most efforts at automatically creating multilingual lexicons require input lexical resources with rich content (e.g. semantic networks, domain codes, semantic categories) or large corpora. Such material is often unavailable and difficult to construct for under-resourced languages. In some cases, particularly for some ethnic languages, even unannotated corpora are still in the process of collect...
متن کاملCross-language F0 modeling for under-resourced tonal languages: a case study on Thai-Mandarin
This paper proposed a novel method for F0 modeling in under-resourced tonal languages. Conventional statistical models require large training data which are deficient in many languages. In tonal languages, different syllabic tones are represented by different F0 shapes, some of them are similar across languages. With cross-language F0 contour mapping, we can augment the F0 model of one under-re...
متن کامل